Natural language processing
Definition and introduction
“The value to our society of being able to communicate with computers in everyday “natural” language cannot be overstated. Imagine asking your computer “Does this candidate have a good record on the environment?” or “When is the next televised National League baseball game?” Or being able to tell your PC “Please format my homework the way my English professor likes it.” Commercial products can already do some of these things, and AI scientists expect many more in the next decade. One goal of AI work in natural language is to enable communication between people and computers without resorting to memorization of complex commands and procedures. Automatic translation—enabling scientists, business people and just plain folks to interact easily with people around the world—is another goal. Both are just part of the broad field of AI and natural language, along with the cognitive science aspect of using computers to study how humans understand language.” (Natural Language Processing,n.p.)
Advancements in technology and education have brought numerous changes and improvements in the fields of Artificial Intelligence and Linguistics. Natural language processing is a part of linguistics and artificial intelligence (the study of making intelligent machines).
Natural Language Processing (NLP) can be defined as “a range of computational techniques for analyzing and representing naturally occurring text (free text) at one or more levels of linguistic analysis (e.g., morphological, syntactic, semantic, pragmatic) for the purpose of achieving human-like language processing for knowledge-intensive applications.” (AHIMA e-HIM Work Group, n.p.)
Through natural language processing, natural human language is converted into formal language that computers can understand and use easily. Natural languages refer to the languages that people speak, like English and Japanese, whereas artificial languages include programming languages or logic.
NLP involves the creation of software that understands human languages and transforms them into computer programs. This is not an easy task. It requires total command of the meanings hidden under each word or phrase. After extensive researches, scientists have made it possible to convert natural language systems into symbols that humans can easily understand. However, there is still a lack of such languages that a person understands. Those who speak English sometimes write words in a very short form. This makes it difficult for computers to understand the meaning whereas humans can guess or understand the real meaning behind those short sentences. For example, the word ‘plane’ has different meanings. If someone writes aeroplane as plane, it is difficult to understand by computer systems. Proper engineering work and use of machinery with much advance technology has made it possible to provide response on natural language, which is the language of humans. These involve many things like:
· Text critiquing
· Information retrievals
· Question answering
· Different games setting
· Summarization of text.
New advancements in computer technology have made it possible to check grammar in different languages like English, German, Spanish, and French. Nowadays, this technology is also used in cell phones and in the formation of language conversion and translation software. With the help of such software, it is possible for humans to use computers in any area or any country of the world easily in their own language.
This communication system of languages made it possible to set languages for computers. It focuses on such questions: what the person said, what the meaning is, is that word similar to some other words, what is the relationship between two words, and what is the meaning of a certain word in the context of the whole topic. Therefore, the study of natural language is related to rules that are used for understanding phrases & clauses and for transforming them into computer-understandable languages. It is possible through the augmented transition network formation, which is also used in LUNAR. It is capable of categorizing words in accordance with different types of grammatical structures. It can be transferred to other networks also and is capable of performing many tasks in writing. The transformation of one’s action and representation into a language is much harder, but it is possible to design such software in which such transmissions are possible in order to respond to different situations.
A significant amount of research work has been done in other fields of linguistics for conversion languages in many different forms and symbols. Nearly three decades of computer science advancements involved programming of languages to understand the linguistic objects. Afterwards, many changes were made in computer programming which led to the start of natural language processing. For example, if someone wants to count the occurrence of certain alphabets and numbers in a piece of writing, it will be a very difficult and time-consuming task. Computers make it possible to count them in seconds as per the requirements of users. Hence, computers were used to find out statistical terms like the frequency of the occurrence of a particular word. After that, significant work was done on languages and computation by statistical data. This is known as computational linguistics. After the success of the first linguistic application, the next step was the development of feasible machine translation software. This was very famous among Americans and other foreigners, as they had invested previously in linguistic computation.
Machine translation involves the text with meanings too. It remained famous for many years. Afterwards, it was transformed to machine language programs in which, word-to-word description is given. There were many problems in programming languages at that time. Common languages were APL, PASCAL, and LISP; these were helpful for programmers in understanding more without many problems.
A tremendous change, which occurred in the programming of computer languages, was wino grads SHRDLU program, which was introduced in 1971. This program language was LISP, which served as the language of intelligence for researchers of the United States. After some time, scientists conducted further researches in the development of natural language processing. In the end, they made significant progress when languages PROLOG and OPS5 were introduced. These languages have much easier working standards as compared to previous ones. These have more grammatical structure providence with the ability of achieving different tasks like medical diagnosis and other statistical & mathematical measurements.
Even in a programming language like LISP, which is not inherently declarative, it is possible and effective to use declarative programming techniques by representing facts and rules as data structures and by implementing various kinds of interpreters or compilers for using information in particular tasks. This approach is widely regarded as the most appropriate one for modern NLP. It means that there can be a much closer dialogue than was previously possible, between NLP researchers, on one hand, and those few theoretical linguists with an interest in grammar formalism design, on the other hand.
The artificial language processing idea was very helpful for the improvement of natural language processing to make more researches and development. However, artificial language processing was used in compilers and interpreters. With the improvement of natural language processing, the improvement in grammar, speeches, conclusions, footnotes, and meaning is easy. Now, the word form for computational language processing is natural language processing. Much work on natural language processing was done in English language because English is famous and commonly spoken language in many countries. One more reason is that Americans and other Europeans, who speak English, did most of the research work. Three things are very important in natural language processing:
Computational linguistics field was established in 1950s in America, which made possible for computers to transform text from any language into English. For transformation of languages, the need to understand both the languages arises. That is why much development was also made in the field of linguistic studies. The method of transforming languages needs grammars of both the languages, which includes the grammar of word transformation, its similarity with other words, and its meanings. This computational linguistic is not the single field of study now. With significant research, large development is taking place in making divisions of computational linguistics. The two subdivisions of computational linguistics made at that time were parsing and generation, which deal with the taking of language and combining them.
TASK AND LIMITATIONS
If someone talks about the relation of humans with computer then surely he thinks that natural language processing is one of the famous methods through which such a relationship develops. As in the origin discussion, it was said that an early computer language was SHRDLU, which was very small and had lesser capacity. Then, it was the blocked language, which did not have the capability of involving much probability in the work. There was lack of grammatical work in that language programming. These languages are not used now because significant development has taken place in the making of systems with more technological advancements in real world. Nowadays, computer systems are also made with approaches that are more realistic and more languages are formed with the capacity of understanding languages. It is hard for languages to describe many problems. Still, these involve major problems and sub problems to give the solution of a big problem. It is related to understanding human intelligence and comparing human intelligence with computer programming. It involves Artificial Intelligence (AI) that is the most difficult task in language settings. It is related to the intelligence problem. The concept behind such a problem is to make computers intelligent like people. This was described by Fanyas Montalvo.
These problems are understandable for people because they know even complex methods to combine various languages. It is hard for computers to do the same. These involve problems like:
1. Computer vision:
This problem is associated with the use of digital computers to explain the information by making its categories.
2. Natural language understanding:
It is the analysis of computer and making of text, which is related to the natural language processing. This is one of the tasks for researchers to make computer understand different languages. When information is provided to a computer in a particular language, it will give results in the same language as computers have the ability to speak to any language system. This is called user and computer interaction. The first task, which is involved in the setting up of the problem, is to transform natural language into formal language. To proceed further for manipulating data, computer programming must involve tables, graphics, factual data, reasoning, and the response to natural language. The task is also the provision of summary and translation of the text. Some of the most advanced techniques used for dealing with different faces of languages involve:
· Discourse context,
One of the most important tasks for natural language processing nowadays is the syntactic technique. Extraction of any information provided is possible with semantic analysis. More development is needed to understand this problem in non-English languages as most of the work has been done on these phases in English language. However, conversation and understanding in other languages is also necessary. For more advancement in natural language processing, further research is needed.
Windows needs to go more modular,” says Michael Silver, a Gartner analyst, on Tuesday. “Microsoft needs to be far more nimble with both Windows and Office”. Steve Ballmer said Windows is a “very long-lived platform” which will continue to evolve, drawing in even more features, such as natural language voice recognition. He intimated that Windows would become bigger, not smaller. “We’ve got a very long list of stuff our engineers want to do, a long list of stuff all of the companies here want us to do”.
Most people think that setting up natural language processing is not good for making human-to-human relationships as natural language settings describe the human–to-computer relationship. This is because computers operate much faster than humans do. It is not feasible to get information from computers when their speed of processing information is higher than the speed of humans. The user’s satisfaction with any system varies with the processor’s working of extracting the variability. Therefore, unrestricted systems are beneficial for users, not restricted ones.
As systems are based on high technological works with high knowledge settings, there is no easy way to handle these systems. There are many complexities involved that need proper solutions. Otherwise, the whole work will be lost. Some of the concrete problems associated with natural language processing are as follows:
The problem of written structure is experienced in processing grammatical information when two sentences are written with similar grammatical structure but different meanings. These sentences will not be understandable for a person unless he properly studies the behaviors of the things discussed in the sentences and the meanings of words used in those sentences.
The problem of interpreting a single sentence can come in many different forms as one sentence can have many writing styles. These sentences, wherever they occur, are based upon the parts of speech. Therefore, if a user does not have knowledge of the parts of speech, he cannot easily detect the problem(s) in a sentence. Ultimately, he will not understand whether sentences are written in similar styles or different ones.
Adjectives are used in English language and many other languages. Extraction of adjectives is not easy for users who are unfamiliar with the parts of speech.
Expression matching problem is one of those problems, which are counted as concrete problems for computer system. These are mapping of specific phrases. For example, a sentence may contain a scientific formula that uses names of chemical substances that are always in short forms. Therefore, it is hard to distinguish such words from normal words and abbreviations.
It is hard to understand in language processing where in a formula, the subtraction, addition, and other signs occur. If the sentence is checked before rewriting, it will be helpful in decreasing the number of errors.
Procedural descriptions, representative figures, and processing tables, which are collection of numbers and descriptions, are hard to understand for language processing.
Syntax and semantics interaction is necessary for every system to transform the language. Besides that, the identification of concept is also important.
Besides these concrete problems, there are also some sub problems, which arise from the concrete problems. These are smaller problems but are noteworthy for computer programmers. These involve the following:
In many languages, the speaking sound of successive words has no boundaries. Certain boundaries are needed by natural language processing software. These involve the understanding of a single word and its transformation. It is very difficult for the computer to decide where a sentence ends and where grammatical changes are necessary.
For languages that have text in joint form like Chinese, it is difficult to make changes at the end of words, as it is difficult to decide where sentences and words end. This needs extraction of a single word from the sentence, which is not possible in such languages. Therefore, it is difficult to show the meaning of each specific word.
Most of the words have numerous meanings. If a single word has more than three meanings, it is impossible to decide for a communicating device which meaning works best at that place. The relevant meaning of each word with respect to the context is necessary for natural language processing.
If there are many words from foreign languages in a speech, it is difficult for a language processor to understand such words. Similarly, extensive use of local abbreviations makes the job of language processors difficult.
The sentences speeches, its acts, and its plans involve complexities in understanding the problem.
The problem of finding links between specific words is very difficult if full command on a language is not present. Full command on any language requires command on its words and meanings with respect to grammatical rules.
Competence errors and performance errors are experienced in the transference of language, which deals with understanding a language.
Natural language database systems can make use of syntactic knowledge and knowledge of actual databases to relate natural language input to the structure and contents of that database. Of course, the system will expect users to ask questions pertaining to the domain of the database, which in turn represents some aspects of the real world. Syntactic knowledge usually resides in the linguistic component of the system, in particular in the syntax analyzer whereas knowledge about the actual database resides in the semantic data model used.
These were some of the problems associated with the transmission of language. Natural language processing requires vast experience, extensive research, and communication techniques in all languages to make a system user-friendly and helpful in every language. In the past, American researchers did most of the work in making a system more understandable. They also worked on transformation of languages other than English but for this purpose, knowledge and research was necessary which Americans lacked.
STATISTICAL NATURAL LANGUAGE PROCESSING
Natural language processing can be made easy through statistical approaches. Speech tagging, language modeling, prepositional phrase attachment, spelling and grammar correction, and word segmentation are key elements of such an approach. Many statistical models are present to understand artificial intelligence in language settings. These include probabilistic context and free grammar.
Nowadays, statistical methods for natural language processing are common amongst many researchers. Such methods include conditional probability, elementary statistics, and joint probability in the statistical natural language processing. These methods are stochastic, probabilistic, and statistical in nature. They are applied in the case of longer, complex sentences. Simple sentences use Corpora and Markov models. The word ‘Corpora’ has been derived from corpus linguistics meaning study of language in symbols. It is the derivation of a group of parts of speech which represents many words and has different meanings too. These computation methods are considered one of the best research methods in linguistics. Because of these methods, machine transfers are possible at high levels. Markov chain is a famous model in mathematics. This system has changing ability as it is like probability, which cannot be predicted beforehand. For discussions on natural language processing, statistical works and models of Markov have provided significant knowledge to researchers to carry out programming. Therefore, statistical language is a famous consideration in linguistics. The Markov chain consists of:
Steady state analysis
And limiting distributions
The Markov chain with a finite state space
Markov chain with general state space.
Statistical natural language processing is, in my estimation, one of the most fast-moving and exciting areas of computer science nowadays.
In statistical mechanics, the Markov theory discusses probability theory and mathematical tools. These are beneficial in language conversion, as languages need statistical forms too to understand symbols, formulae, and abbreviations expressed in statistical terms. Statistical methods also involve hidden Markov models. Besides these models given by Markov, there are some other models too for statistical natural language processing which are well known. These are used by modern researchers to make artificial intelligence more easy and understandable. Writing information on the system involves many things. Sometimes, people use symbols in specific languages. Therefore, researchers must know alternate symbols in other languages if they want to design comprehensive language conversion software. Transfer of languages, which are based on machines, involves natural language processing and empirical linguistics. Capability of transforming and making correct meanings to the information structure of different languages is possible now. European countries are carrying out more research in the development of natural language processing software.
Statistical tools are very important for natural language systems; for understanding the sentence in clear form; for making correct language settings; for finding correct meanings, which are closest to the word; and for matching symbols. Nowadays, such systems involve many things as mentioned in the statistical column. People have become more aware about statistical knowledge. Moreover, researchers have made efforts to find out easier forms of setting up sentences; such methods are mathematical or statistical.
MAJOR TASKS IN NATURAL LANGUAGE PROCESSING
Some initial tasks in natural language processing were discussed above. Besides these, there are also some major tasks for natural language processing which are associated with the transfer of natural language to a formal one. These involve the following:
Programmers make shorter versions of text to increase accuracy and to make software more user-friendly. This procedure is known as summarization or automatic summarization. For this purpose, summary is one of the important things to be considered. A computer system may have the ability to shorten or summarize the description in any language. This involves access to coherent summaries. The result of this process includes the most significant features of the initial text. As people can easily retrieve data nowadays, the importance of automatic summarization has increased. Search engines are good examples of automated summarization software. Any software that produces a logical summary must consider certain features like the extent, style, and syntax to create a good summary.
There are two main aspects of summarization: extraction and abstraction. Extraction means copying data that seem highly significant by the computer and adding it to the summary. Such important data may include statistics, keywords, and abbreviations that occur more frequently in the text. On the other hand, abstraction means converting portions from the original text into other words. In other words, abstraction shortens text in a better way as compared to extraction. However, development of such programs is more difficult as they require expertise in natural language generation technology. Machine learning procedures are used to achieve the task of automated summarization. Information retrieval and text mining features are used for this purpose.
Two types of software can be developed to summarize information. The first one may summarize the whole information without any assistance from humans. This is an extremely difficult task. On the other hand, the second type of summarization software can be those that operate with the help of humans. Such software may do partial work regarding summarization and humans can finish the work afterwards. For example, software can highlight the information that seems important, and it can insert synonyms of the highlighted words. Later on, humans can decide whether those synonyms are correctly placed or not. Humans can accept automatic insertions of synonyms or reject them.
· A language aid is provided in computer programs to help those who have little knowledge of English or other languages. Provision of formal language aid is a major task at this time. It helps those who do not have knowledge of foreign languages. Language aid software can be used for two purposes: reading aid and writing aid. Software that aid reading can be used to improve pronunciation and oral comprehension skills. Moreover, users can make their tone and pitch accurate to make their conversation similar to native speakers.
On the other hand, writing software can help users in improving their grammar skill of a particular language. Users can also improve their vocabulary if such software provides a dictionary or a thesaurus. Furthermore, writing aid software can make on-the-spot corrections. When such software make similar on-the-spot corrections many times, the writer learns that particular grammar rule quickly.
Information extraction is necessary in language settings. This is linked to a wider field known as information retrieval. In natural language processing, information extraction makes users’ work easier by categorizing data contextually and semantically. Nowadays, because of scientific advancements, information is also growing in unstructured forms. A good extractor is needed to write such unstructured information.
Because of advance methods used in writing, extensive search is necessary. Therefore, natural language processing must be capable of searching things. This searching may be carried out online or offline. Automated Information Retrieval software decreases the burden of searching and managing too much information. Many academic institutions and libraries use such software to permit right of entry into databases which contain journals, books, and other documents. This software involves the concepts of query and object. Queries are proper statements of data requirements that are inserted into IR software by users. An object is a unit that saves data in a database. If the queries of a user match the objects saved in the database, users see results.
Machine translation is another task used to find out the capability of software in translating text from one natural language to another. Simple forms of machine translation software deal with the translation of each individual word into another language. On the other hand, complex machine translation software tries to understand phrases and idioms before translating them into other languages. Today, machine translation is a major task as this system is used all over the world. Therefore, knowledge of a language is necessary to understand correct forms of machine translated text. Programmers are needed to transfer the correct meaning of a word in other languages.
Entity Identification software are used to categorize text into predefined terminologies. For example, such software can distinguish between names of countries and states from maes of persons. Similarly, such software can tell whether a number represents a monetary value or a percentage by identifying symbols of monetary currencies ($/€) or percentage (%).Moreover, abbreviations can be easily described through language processing using such entity identification software.
Natural Language Generation (NLG) is another major task in NLP that is used to create natural language from a computerized system. In natural language generation, the software has to decide how it will transform an idea into a complete sentence. Natural language understanding may be used in such software to understand the information and collect it in any language.
Extraction of text from images is also a major task in language processing. The use of optical character recognition is common now to find out images. Character recognition software is used for this purpose. Images of text can be scanned, recognized, and converted into editable text by such software. This saves time that might have been spent on retyping a text that is present in hard copy but not in an electronic copy. Moreover, there may be images in electronic copy that cannot be edited as they are saved in an image format on computers. Using character recognition software, text can be extracted and saved in text format that can be edited using other software like MS Word.
There is a need for software that is capable of producing answers in any language. Natural language processing must include question answer resolving system in any language or with the language aid that it deals with. A Question answering (QA) system can be classified as a part of information retrieval. Using a database of documents, the QA system can provide answers to questions asked using natural language. QA software necessitates more advanced NLP tactics as compared to other forms of information retrieval. Moreover, it goes beyond the concept of search engines as QA software searches for precise answers whereas search engines give random results in a haphazard form. Databases used for this purpose have no limit. They can be an offline library or a collection of online libraries and the worldwide web.
· Speech recognition or voice recognition is a noteworthy part of natural language processing. This involves conversion of voice signals into words using algorithms designed in software. Every person has his own way of speaking words. Therefore, when words do not clearly show their meanings, it may be because their accents are different. It is hard for the system to find out the correct meanings if the sound of a word does not match the sound present in the software’s database.
Simplification and summarization of the text is also one of the major tasks. Text simplification refers to the conversion of complex sentences in such a form that the meaning remains the same but the sentence structure is made easier to comprehend.
The use of speech synthesizer (for text to speech conversion) is also beneficial as it transforms text into speech similar to human voice. Those people who have visual imparities use it.
Language processing also addresses the task of proofreading. This ensures that there are no typing mistakes made unintentionally.
These were some of the major tasks, which are necessary to be fulfilled in this advance era as computer usage increases. To reach high standards in considering these tasks, one should think about the areas of natural language processing that would have more value in future. As technology grows, research works also need advancements. Making user-friendly software is the most important thing that matters to everybody.
The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system, in order to determine if (or to what extent) the system answers the goals of its designers, or the needs of its users. Research in NLP evaluation has received considerable attention, because the definition of proper evaluation criteria is one way to specify precisely an NLP problem.
Natural language processing is not only the modern technology of computation but it is the evaluation of claims too about human language. Natural language processing, as mentioned above, is the interaction between computerized systems and the human natural language. This is one of the best ways to provide a link between humans and computers. Through this link, one can easily make software that is capable of transforming a specific language into another one. Most of the research on natural language processing has been done in English because researchers were mostly Americans or English-speaking Europeans. Human behavior and speech is different from that of a computer. Therefore, significant research and expertise is needed in making good natural language processing software. The main advantage in using computers is that it works a lot faster than humans. Moreover, it does not need any mental or physical rest. However, since it works automatically, it cannot think as humans do. This is where problems arise. Computers can work in limitations that have been set previously by software programmers and designers. It does not know how to deal with a situation that has not been previously covered in its system.
Natural language information processing (NLIP) has made significant progress, in important ways, in the last twenty years. We have developed fairly comprehensive and robust tools like grammars and parsers, and have gained experience with applications including multilingual ones. We have been able not only to take advantage of the general advancements in computing and communications technology but, more significantly, to exploit by-now vast text corpora to adapt our tools to actual patterns of language use.
Thus, it can be said that programmers will be able to write software in future, which will solve the problems currently faced. As we move further into the 21st century, we can expect to benefit from future researches in the field of natural language processing all over the world.
AHIMA e-HIM Work Group on Computer-Assisted Coding. “Delving into Computer-assisted Coding. Appendix G: Glossary of Terms” Journal of AHIMA 75, no.10 (Nov-Dec 2004): web extra.
http://library.ahima.org/xpedio/groups/public/documents/ahima/bok1_025042.hcsp?dDocName=bok1_025042 Accessed May 29, 2007
Brill, E. 1995. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computational Linguistics 21(4): 543-565. (http://portal.acm.org/citation.cfm?id=1117823)
Church, K. 1988. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. In Proceedings of the Second Conference on Applied Natural Language Processing, 136-143. Somerset, NJ. Association for Computational Linguistics. (http://www.netautopsy.org/natlngpr.htm)
Klavans, J., and Resnik, P., eds. 1996. The Balancing Act: Combining Symbolic and Statistical Approaches to Language. Cambridge, Mass.: MIT Press. (http://portal.acm.org/citation.cfm?id=1219915)
Langley, P., and Carbonell, J. 1985. Language Acquisition and Machine Learning. In Mechanisms of Language Acquisition, ed. B. MacWhinney, 115-155. Hillsdale, NJ. Lawrence Erlbaum.
McClelland, J. L., and Kawamoto, A. H. 1986. Mechanisms of Sentence Processing: Assigning Roles to Constituents of Sentences. In Parallel Distributed Processing, Volume 2, eds. D. E. Rumelhart and J. L. McClelland, 318-362. Cambridge, Mass.: MIT Press. (www.ai.mit.edu/projects/jmlr/papers/volume4/califf03a/source/local.bib)
Miikkulainen, R. 1993. Sub symbolic Natural Language Processing: An Integrated Model of Scripts, Lexicon, and Memory. Cambridge, Mass.: MIT Press. (www.cs.ucla.edu/~dyer/Papers/CAINSP/StatusPap95.html)
Mooney, R. J., and DeJong, G. F. 1985. Learning Schemata for Natural Language Processing. In Proceedings of the Ninth International joint Conference on Artificial Intelligence, 681-687. Menlo Park, Calif.: International joint Conferences on Artificial Intelligence. (http://portal.acm.org/citation.cfm?id=66445)
Mooney, R. J., and DeJong, G. F. 1985. Learning Schemata for Natural Language Processing. In Proceedings of the Ninth International joint Conference on Artificial Intelligence, 681-687. Menlo Park, Calif.: International joint Conferences on Artificial Intelligence. (http://portal.acm.org/citation.cfm?id=66445)
Natural Language Processing (n.d.)
http://www.aaai.org/AITopics/html/natlang.html Accessed May 29, 2007
Woods, W. A. 1977. Lunar Rocks in Natural English: Explorations in Natural Language Question Answering. In Linguistic Structures Processing, ed. A. Zampoli. New York: Elsevier.