User talk:Thisismyusername31

This user is a student editor in University_of_California,_Berkeley/PLP_-_Berkeley_Interdisciplinary_Research_Group_on_Privacy_-_Coleman_Lab_(Spring_21) .

Welcome![edit]

Hello, Thisismyusername31, and welcome to Wikipedia! My name is Ian and I work with Wiki Education; I help support students who are editing as part of a class assignment.

I hope you enjoy editing here. If you haven't already done so, please check out the student training library, which introduces you to editing and Wikipedia's core principles. You may also want to check out the Teahouse, a community of Wikipedia editors dedicated to helping new users. Below are some resources to help you get started editing.

Handouts
Choosing an article to edit Editing an article • Add images to articles Citing sources on Wikipedia
Additional Resources
You can find answers to many student questions in our FAQ.

If you have any questions, please don't hesitate to contact me on my talk page. Ian (Wiki Ed) (talk) 20:17, 18 February 2021 (UTC)[reply]

Peer Review - Denmum[edit]

Hello! This is Denmum. I enjoyed your data sanitization draft. One thing would be sourcing and another would be some more elaboration in some sections. Otherwise, great job!

Peer Review Week 8[edit]

Hello Thisismyusername31!

I think this is a really well-done article! The lead section is strong and gives a good overall sense of the article without going into too much detail right away. The lead sentence gives a concise definition of the topic. In your lead section, you do mention and go into detail about Internet of Things (IoT) technologies, yet it is only briefly mentioned later in the article. You might want to consider saving some of the specifics about this technology such as "These devices are able to easily transfer information from one device to another... For example, devices like Alexa and Google Home..." to use later in the article, such as in the section about risks associated with this technology.

Under the "Applications of data sanitization" section you use the terminology "blockchain" a lot, but I do not see a clear definition of the term. I think this would be helpful to include so that the reader can have a better understanding, or you could link to a longer Wikipedia article about it. I would also consider change contractions such as "it's" to "it is" in order to sound more formal and encyclopedic, though this is a small detail.

In "Risks Associated" I found these two sentences to be a bit unclear "Some methods of data sanitization have a high sensitivity to distinct points that have no closeness to data points. This type of data sanitization is very precise and can detect anomalies even if the poisoned data point is relatively close to true data." You might want to define specifically the terms you use such as distinct point, data point, true data, and poisoned data. Remember that you are speaking to a general audience that may not have prior knowledge. Linking to these terms could also help the reader clarify anything they do not understand.

Overall this is a great, detailed, and well-organized article. Good job :)

Casademasa (talk) 07:17, 9 April 2021 (UTC)[reply]

week 8 edit[edit]

I really like what you've done so far; it's very comprehensive and includes a lot of detail. The sections are organized in a way that makes sense and the language is eloquent and professional. One thing I would definitely recommend would be adding your citations as you go along. Overall, though, it's a really good draft that definitely has the feel of a Wikipedia article!

redpandafan — Preceding unsigned comment added by Redpandafan (talk • contribs) 23:14, 9 April 2021 (UTC)[reply]

Peer Review from Junior Leadership Team[edit]

Lead: The first sentence should give a concise and clear definition. Replace "involve" with "is." IoT and 5G are not the central topics for your page on Data Sanitization, so you don't need to explain these two terms in your lead section. It is great that you mentioned the "Risk Associated" section in your lead, but this mention should be brief. You can move the example of Alexa and Google Home to the risk section. Overall, you did very great work in explaining the concept. You also covered all the sub-sections in your lead.

Content: PPDM and Data Sanitization are equivalent, but I would suggest you use the latter as it is the name of this Wikipedia page. It is great that you have bullet points in the "Applications of Data Sanitization." You can really expand on each bullet point and explain how Data Sanitization is applied in these different areas. This section is not detailed enough and needs more elaborations, but I can see the backbone. Your next section on "Risk Associated" is in excellent shape now. One small thing is that you don't have to include the citation and contributors in your Wikipedia paragraph when you refer to a study, but add a citation instead. I really like the way on how you incorporate academic studies into your article. For the section"Methods of Data Sanitization," I can also see a solid and clear backbone and structure. Remember this is a Wikipedia article and should cover the details and important aspects of the subject. Instead of saying "many methods," you should list out the method's name and explain how they are being used and their roles in the scope of Data Sanitization.

Sources: I highly suggest you include citations alongside your writing. It is a good way to track your editing record and refer back.

Organization & Writing Quality: As mentioned above, your current organization is very clear and strong, an essential foundation for a strong article. There are some grammatical issues that you should be aware of. If you would like, let me know, and I can send you a word doc with annotations indicating these grammatical issues.

In the next few weeks, please elaborate on your current sections. If you have other great ideas, it will be great to add a few more new sections. Exploredragon (talk) 19:37, 16 April 2021 (UTC)[reply]

Week 9 Peer Review[edit]

Data sanitization involves (I wouldn't use the word "involves". On an article about data sanitization, I'm probably looking for someone to define the term for me and using "involves" doesn't tell me what data sanitization actually is) the process of permanently removing and hiding sensitive information during the usage of datasets for study or transfer of information from one device to another (I would reword or breakup this sentence since it's not clear whether you're talking about two different applications of data sanitization or two different usages for datasets.). This technique is essential for taking useful information from original databases while avoiding infringing on private information that may be stored in these databases. In recent decades (a bit vague, I would give a concrete time period), there has been an increasing use of database information in the generation of electronic tools, such as 5G mobile data. There has also been increasing usage of Internet of Things (IoT) technologies. IoT technologies refer to smart devices equipped with sensors, cameras, recording devices, and other sensory tools that are then linked directly to other devices through the internet. These devices are able to easily transfer information from one device to another, however, this ease of transfer also poses a major privacy challenge. There are many risks associated with transferring information because sensitive, raw data needs to be removed in between. For example, devices like Alexa and Google Home need to be equipped with data sanitization tools that eliminate the leakage of private data that may be collected. Data sanitization is also commonly referred to as Privacy Preserving Data Mining, or PPDM, as it aims to preserve important information while using algorithms to filter out sensitive details. Currently, many models of data sanitization rely on heuristic methods that delete or add information to the original database in an effort to preserve the privacy of each subject. However, there have also been numerous new developments of PPDM that rely instead on machine learning and deep learning techniques. (I think the intro is good, but it talks about too many different things. I would put some of this stuff, such as IoT, into one of the other sections. The intro should be to introduce the topic of data sanitization. Any ancilliary information can be left to other sections)

Applications of Data Sanitization

Privacy Preserving Data Mining (PPDM) has a wide range of uses and is an integral step in the transfer or use of any large data set. It is also commonly linked to blockchain-based secure information sharing within supply chain management systems.

5G data

Internet of Things (IoT) technologies eg: Alexa, Google Home, etc.

Healthcare industry, using large datasets

Supply chain industry, usage of blockchain and optimal key generation

(Maybe I missed something, but I'm not sure what this list is. I would clarify what you're listing here.)

Browser backed cloud storage systems are heavily reliant on data sanitization and are becoming an increasingly popular route of data storage. Furthermore, the ease of usage is important for enterprises and workplaces that use cloud storage for communication and collaboration. (Examples)

Data sanitization is especially relevant for the medical field or large public organizations that need to use very large databases of sensitive data. It's those organizations that need to find efficient ways to hide sensitive data while maintaining functionality. (Examples?)

Blockchain is used to record and transfer information in a secure way and data sanitization techniques are required to ensure that this data is transferred more securely and accurately. It’s especially applicable for those working in supply chain management and may be useful for those looking to optimize the supply chain process. The need to improve blockchain methods is becoming increasingly relevant as the global level of development increases and becomes more electronically dependent. (I would elaborate more on this)

Risks Associated

Inadequate data sanitization methods can result in two main problems: a breach of private information and compromises to the integrity of the original dataset. If data sanitization methods are unsuccessful at removing all sensitive information, it poses the risk of leaking this information to attackers. Numerous studies have been conducted to optimize ways of preserving sensitive information (This is a good place to cite one of those studies). Some methods of data sanitization have a high sensitivity to distinct points that have no closeness to data points. This type of data sanitization is very precise and can detect anomalies even if the poisoned data point is relatively close to true data. Another method of data sanitization is one that also removes outliers in data, but does so in a more general way (As someone not familiar with data sanitization, this explanation is a bit confusing). It detects the general trend of data and discards any data that strays and it’s able to target anomalies even when inserted as a group. In general, data sanitization techniques use algorithms to detect anomalies and remove any suspicious points that may be poisoned data or sensitive information.

Furthermore, data sanitization methods may remove useful, non-sensitive information, which then renders the sanitized dataset less useful and altered from the original. There have been iterations of common data sanitization techniques that attempt to correct the issue of the loss of original dataset integrity. In particular, Liu, Xuan, Wen, and Song offered a new algorithm for data sanitization called the Improved Minimum Sensitive Itemsets Conflict First Algorithm (IMSICF) method. There is often a lot of emphasis that is put into protecting the privacy of users, so this method brings a new perspective that focuses on also protecting the integrity of the data. It functions in a way that has three main advantages: it learns to optimize the process of sanitization by only cleaning the item with the highest conflict count, keeps parts of the dataset with highest utility, and also analyzes the conflict degree of the sensitive material. Robust research was conducted on the efficacy and usefulness of this new technique to reveal the ways that it can benefit in maintaining the integrity of the dataset. This new technique is able to firstly pinpoint the specific parts of the dataset that are possibly poisoned data and also use computer algorithms to make a calculation between the tradeoffs of how useful it is to decide if it should be removed. This is a new way of data sanitization that takes into account the utility of the data before it is immediately discarded. (Maybe merge this paragraph with the previous? It seems to be continuing an idea from the previous)

Methods of Data Sanitization

An important goal (There is probably a less subjective word to use here than "important." Remember, we should have an objective tone) of PPDM is to strike a balance between maintaining the privacy of users that have submitted the data while also enabling developers to make full use of the dataset. Many measures of PPDM directly modify the dataset and create a new version that makes the original unrecoverable. It strictly erases any sensitive information and makes it inaccessible for attackers.

One type of data sanitization is rule based PPDM that uses defined computer algorithms to clean datasets. Association rule hiding is the process of data sanitization as applied to transactional databases. Transactional databases are the general term for data storage used to record transactions as organizations conduct their business. Examples include shipping payments, credit card payments, and sales orders. This source analyzes fifty four different methods of data sanitization and presents its four major findings of its trends (add period)

Certain new methods of data sanitization that rely on machine deep learning. There are various weaknesses in the current use of data sanitization. Many methods are not intricate or detailed enough to protect against more specific data attacks. This effort to maintain privacy while dating important data is referred to as privacy-preserving data mining. Machine learning develops methods that are more adapted to different types of attacks and can learn to face a broader range of situations. Deep learning is able to simplify the data sanitization methods and run these protective measures in a more efficient and less time consuming way.

There have also been hybrid models that utilize both rule based and machine deep learning methods to achieve a balance between the two techniques. (Examples?)

Overall, I like what you have so far. There aren't any major issues with your article. Most of everything is clear and interesting. There are a few instances of vague wording and there are definitely more you can add to this article since you bring up many topics, but don't have any examples. I would also define data sanitization more clearly and clarify how what you are discussing in each section relates to it. I look forward to seeing how this turns out. Good work so far! — Preceding unsigned comment added by Superunsubscriber (talk • contribs) 22:18, 17 April 2021 (UTC)[reply]

Week 9 Peer Review - Denmum[edit]

Data sanitization involves the process of permanently removing and hiding sensitive information during the usage of datasets for study or transfer of information from one device to another. This technique is essential for taking useful information from original databases while avoiding infringing on private information that may be stored in these databases. In recent decades, there has been an increasing use of database information in the generation of electronic tools, such as 5G mobile data. There has also been increasing usage of Internet of Things (IoT) technologies. IoT technologies refer to smart devices equipped with sensors, cameras, recording devices, and other sensory tools that are then linked directly to other devices through the internet. These devices are able to easily transfer information from one device to another, however, this ease of transfer also poses a major privacy challenge. There are many risks associated with transferring information because sensitive, raw data needs to be removed in between. For example, devices like Alexa and Google Home need to be equipped with data sanitization tools that eliminate the leakage of private data that may be collected. Data sanitization is also commonly referred to as Privacy Preserving Data Mining, or PPDM, as it aims to preserve important information while using algorithms to filter out sensitive details. Currently, many models of data sanitization rely on heuristic methods that delete or add information to the original database in an effort to preserve the privacy of each subject. However, there have also been numerous new developments of PPDM that rely instead on machine learning and deep learning techniques.

I like this section as it maintains the objective tone of Wikipedia, while relaying important background and introductory information about the topic.

Applications of Data Sanitization Privacy Preserving Data Mining (PPDM) has a wide range of uses and is an integral step in the transfer or use of any large data set. It is also commonly linked to blockchain-based secure information sharing within supply chain management systems.

5G data Internet of Things (IoT) technologies eg: Alexa, Google Home, etc. Healthcare industry, using large datasets Supply chain industry, usage of blockchain and optimal key generation

Formatting is a bit weird here, maybe a colon before the list + describing what specifically this list is?

Browser backed cloud storage systems are heavily reliant on data sanitization and are becoming an increasingly popular route of data storage. Furthermore, the ease of usage is important for enterprises and workplaces that use cloud storage for communication and collaboration.

Data sanitization is especially relevant for the medical field or large public organizations that need to use very large databases of sensitive data. It's those organizations that need to find efficient ways to hide sensitive data while maintaining functionality. (Sourcing?)

Blockchain is used to record and transfer information in a secure way and data sanitization techniques are required to ensure that this data is transferred more securely and accurately. It’s especially applicable for those working in supply chain management and may be useful for those looking to optimize the supply chain process. The need to improve blockchain methods is becoming increasingly relevant as the global level of development increases and becomes more electronically dependent.

Overall, a good section. Maintains tone, describes applications of data sanitization. Need sourcing in some places.

Risks Associated Inadequate data sanitization methods can result in two main problems: a breach of private information and compromises to the integrity of the original dataset. If data sanitization methods are unsuccessful at removing all sensitive information, it poses the risk of leaking this information to attackers. Numerous studies have been conducted to optimize ways of preserving sensitive information.(Such as?) Some methods of data sanitization have a high sensitivity to distinct points that have no closeness to data points. This type of data sanitization is very precise and can detect anomalies even if the poisoned data point is relatively close to true data. Another method of data sanitization is one that also removes outliers in data, but does so in a more general way. It detects the general trend of data and discards any data that strays and it’s able to target anomalies even when inserted as a group.(Understanding what the two distinct methods are here is a bit unclear.) In general, data sanitization techniques use algorithms to detect anomalies and remove any suspicious points that may be poisoned data or sensitive information.

Furthermore, data sanitization methods may remove useful, non-sensitive information, which then renders the sanitized dataset less useful and altered from the original. There have been iterations of common data sanitization techniques that attempt to correct the issue of the loss of original dataset integrity. In particular, Liu, Xuan, Wen, and Song offered a new algorithm for data sanitization called the Improved Minimum Sensitive Itemsets Conflict First Algorithm (IMSICF) method. There is often a lot of emphasis that is put into protecting the privacy of users, so this method brings a new perspective that focuses on also protecting the integrity of the data. It functions in a way that has three main advantages: it learns to optimize the process of sanitization by only cleaning the item with the highest conflict count, keeps parts of the dataset with highest utility, and also analyzes the conflict degree of the sensitive material. Robust research was conducted on the efficacy and usefulness of this new technique to reveal the ways that it can benefit in maintaining the integrity of the dataset. This new technique is able to firstly pinpoint the specific parts of the dataset that are possibly poisoned data and also use computer algorithms to make a calculation between the tradeoffs of how useful it is to decide if it should be removed. This is a new way of data sanitization that takes into account the utility of the data before it is immediately discarded.

First paragraph is good, needs some clarification in the description of what the two methods are. Second paragraph is great, you went into a good amount of detail, maintained an objective tone and described a researched method.

Methods of Data Sanitization An important goal of PPDM is to strike a balance between maintaining the privacy of users that have submitted the data while also enabling developers to make full use of the dataset. Many measures of PPDM directly modify the dataset and create a new version that makes the original unrecoverable. It strictly erases any sensitive information and makes it inaccessible for attackers.

One type of data sanitization is rule based PPDM that uses defined computer algorithms to clean datasets. Association rule hiding is the process of data sanitization as applied to transactional databases. Transactional databases are the general term for data storage used to record transactions as organizations conduct their business. Examples include shipping payments, credit card payments, and sales orders. This source analyzes fifty four different methods of data sanitization and presents its four major findings of its trends

Certain new methods of data sanitization that rely on machine deep learning. There are various weaknesses in the current use of data sanitization. Many methods are not intricate or detailed enough to protect against more specific data attacks.(Such as?) This effort to maintain privacy while dating important data is referred to as privacy-preserving data mining. Machine learning develops methods that are more adapted to different types of attacks and can learn to face a broader range of situations. Deep learning is able to simplify the data sanitization methods and run these protective measures in a more efficient and less time consuming way.

There have also been hybrid models that utilize both rule based and machine deep learning methods to achieve a balance between the two techniques.

Overall a good section. Goes into detail about some applications, how methods are made more efficient, maintains objective tone and provides good information. Sourcing would be my only concern.

Overall, I really like the contributions you've made on this article. It's clear, succinct, provides facts and maintains the objective tone of Wikipedia. Great job!

Denmum (talk) 04:23, 19 April 2021 (UTC)[reply]

Week 10 Peer Review[edit]

Overall, I really like what you have and the article's tone feels very reminiscent of a Wikipedia article. However, a few things - I think you should definitely be incorporating citations as it might get difficult to keep track of once your essay is fully complete. Next, I think you should add hyperlinks to any technical terms - readers might not be familiar with some of the terminology you're using, and so I think linking other Wikipedia articles (if they exist) could help increase the clarity of your article. Otherwise, the information you've presented is excellent and very in depth!

Redpandafan (talk) 00:44, 24 April 2021 (UTC)[reply]

Week 10 Peer Review[edit]

Hello Thisismyusername31,

This is a very detailed, well-developed, and organized article so good job! One thing I noticed is that you will want to add citations to each piece of information drawn from your sources. The lead section seems to be a bit long, you might want to consider parsing it down or separating it into two different paragraphs. For the "Applications of Data Sanitization," you might consider defining terms like "blockchain" or "supply chain management systems" you would want to write this article for someone who has no prior knowledge of your topic and is using your page as an introduction. So you might want to add some more depth and explanation to the terms and concepts you describe. For example, I think you could explain the Improved Minimum Sensitive Itemsets Conflict First Algorithm (IMSICF) in more detail.

Overall I think the article is really easy to read and covers some great information. Good work!

Casademasa (talk) 02:34, 25 April 2021 (UTC)[reply]

Week 11 peer review from junior leadership team[edit]

Congrats on moving the article to the mainspace! It is so great that now you have added the 20+ annotations to your article. I noticed you add more details to the specific methods to the application section, which is great and echoes what I and your peers have suggested. Now the article is in really good shape! Exploredragon (talk) 10:41, 1 May 2021 (UTC)[reply]

Week 11 Peer Review - Denmum[edit]

Hi Thisismyusername31!

Great job on your article and moving it to the mainspace! You've done some great work that is concise, clear, maintains the tone of Wikipedia and brings something fresh to the table! Here are my thoughts for each section:

Introduction: Great introduction, clear, concise and maintains an objective tone.
Clearing devices: Really interesting and insightful pieces! This section is clear, concise, maintains an objective tone, is sourced, and has good flow.
Necessity of data sanitization: Good!
Applications of data sanitization: Good section, concise, clear and objective tone. Would recommend sourcing the paragraph describing the MR-OVnTSA approach but otherwise, great job!
Risks posed by inadequate sanitization: Great section on the risks! Would recommend sourcing the last paragraph on the Improved Minimum Sensitive Itemsets Conflict First Algorithm (IMSICF) method but otherwise, it's concise, clear and objective!

Made two edits that you can find in the Peer Review page in your sandboxes, but overall this is great work! Denmum (talk) 01:02, 2 May 2021 (UTC)[reply]

Week 11 Peer Review[edit]

Hi Thisismyusername31!

I really like this article. It looks like you're more or less on the right track to being done so it's hard to make many criticisms. I'd say the main thing you should do is add more citations, especially the intro since it lacks any.

I'd also think about merging the sections "Applications of data sanitization" and "Industry specific applications" since they're more or less about the same topic. The latter could be a subsection in the new sections. I think it would makre more sense because the latter section is so brief and itself only has one subsection.

You could also consider cutting down on the length of the titles. The longer they are, the more specific they are, and as a result your article is going to have to accomodate more sections just to cover all potential related topics future wikipedia editors might add to the article. I think sections like "Risks posed by inadequate sanitization" could just be "Risks." It wouldn't lose much; you could just explain what risks you mean specifcially, in this case risks of not sanitizing properly.

Overall, great article and pretty much ready for Wikipedia! I would think about some of the changes I suggested, but regardless I think you've done a good job.